The aim of this document is to derive and present a method of multiple steps to be taken that convert the raw personal environmental data into an analysis ready dataset. For this I selected the data of 5 participants: ACT001D (very good data from visual inspection), ACT004S and ACT014F (some poor data), ACT003C and ACT032V (very poor data). The variables to be cleaned are temperature, relative humidity RH, and noise.
Main reasons for cleaning are:
The steps of cleaning after revision are the following:
First, data has to be excluded that was taken outside the observation window and during personal visit log times if the devices were changed. The data was cut the the observation window in th data compiling but the checking whether the device was changed will be done here. No additional values were excluded from the chosen 5 individuals.
Every Variable (temperature, RH, noise) has its physical limits that the following:
# House
data_H <- data_H |>
mutate(IBH_TEMP_01 = if_else(IBH_TEMP < -273, 1, 0),
IBH_HUM_01 = if_else(IBH_HUM < 0 | IBH_HUM > 100, 1, 0))
# Worn
data_W <- data_W |>
mutate(IBW_TEMP_01 = if_else(IBW_TEMP < -273, 1, 0),
IBW_HUM_01 = if_else(IBW_HUM < 0 | IBW_HUM > 100, 1, 0))
# Taped
data_T <- data_T |>
mutate(IBT_TEMP_01 = if_else(IBT_TEMP < -273, 1, 0))
# Noise
data_N <- data_N |>
mutate(NS_01 = if_else(NS < 0, 1, 0))
No plots are shown here because there are no impossible values in the example data.
The plausible range is to some degree subjective, depends on the observation surroundings and changes not only depending on the variable, but also what the variable describes (temperature taped and house). Therefore now we need to start with device specific variable value ranges.
# House
data_H <- data_H |>
mutate(IBH_TEMP_02 = if_else(IBH_TEMP < 0 | IBH_TEMP > 55, 1, 0))
# Worn
data_W <- data_W |>
mutate(IBW_TEMP_02 = if_else(IBW_TEMP < 15 | IBW_TEMP > 45, 1, 0))
# Taped
# IQR
Q25 = quantile(data_T$IBT_TEMP, .10)
data_T <- data_T |>
mutate(IBT_TEMP_02 = if_else(IBT_TEMP < Q25, 1, 0))
The variability between variables and devices differs significantly (eg. humidity house and worn). Because we are interested in stress experienced by the individuals, it is important to not filter out extreme but realistic conditions as these represent the largest stress impact. However we do want to filter out worn measurements that resemble the variance of the house measurements and indicate the the device was not worn. We use the moving standard deviation of 3 left aligned humidity values. As an additional measure to prevent filtering out reasonable values, we filter only measurements if the standard deviation has been too low for 2 consecutive measurements. We initially considered 4 values, however, in the averaging process this low threshold then introduced “wrong” data back into the hourly averages.
x = 1
# Worn
data_W <- data_W |>
mutate(IBW_TEMP_04_intermediate = if_else(IBW_HUM_MSD < x, 1, 0),
IBW_HUM_04_intermediate = if_else(IBW_HUM_MSD < x, 1, 0),
IBW_TEMP_04 = rollmean(IBW_TEMP_04_intermediate, k = 2, fill = NA, align = "left"),
IBW_HUM_04 = rollmean(IBW_HUM_04_intermediate, k = 2, fill = NA, align = "left"))
The plots below show the cleaned data including all cleaning methods. Light colors indicate cleaned original data.
Note: Noise was now averaged with respect to the inherint log-scale.
| Variable | Mean | Median | Standard Deviation |
|---|---|---|---|
| IBH_HUM | 40.60369 | 38.68000 | 13.4253385 |
| IBH_TEMP | 27.22120 | 27.09375 | 2.6801571 |
| IBW_HUM | 41.59404 | 41.07000 | 12.6813064 |
| IBW_TEMP | 31.08556 | 31.72396 | 3.7832304 |
| IBT_TEMP | 34.63741 | 34.68750 | 0.7487054 |
| NS | 59.15517 | 55.85660 | 62.5825075 |